| EU | spending | immigration | |
|---|---|---|---|
| CDU | 6.38 | 5.73 | 5.73 |
| SPD | 6.38 | 3.45 | 3.91 |
| FDP | 5.69 | 7.36 | 3.60 |
| Gruenen | 6.23 | 2.82 | 2.09 |
| Linke | 3.00 | 0.50 | 4.00 |
| AfD | 4.85 | 6.18 | 7.45 |
| CSU | 1.62 | 7.89 | 9.30 |
Principal Components, Partial Least Squares, Regularization, and Re-Sampling
University of Zurich
Error that cannot be eliminated, not even through fine-tuning or choosing a different algorithm.
Captured by \(\varepsilon\).
Sources include
Missing predictive features.
Measurement error.
Random shocks.
The remedy is to collect more and better data.
Bias means that we systematically err in our predictions.
Common causes include:
Stopping training too early—especially relevant for complex algorithms such as neural networks.
Under-fitting the data.
An under-fit means that we miss elements of the data generating process—our model is not sufficiently complex.
Remedies are non-interference with the learning process and increasing model complexity.
The model does not generalize well to new data; small changes in the data have large consequences for performance.
The major cause of variance error is over-fitting, which means any of the following:
The model is too complex.
The model is too flexible.
We capitalize on chance.
The remedy is to reduce model complexity.
Source: Seema Singh
Total error is \(B^2 + V + I\).
To reduce \(B\), we may have to increase model complexity, but this could increase \(V\).
To reduce \(V\), we may have to reduce model complexity, but this could increase \(B\).
The goal is to find the sweet spot.
We know how to add complexity:
Add more features.
Add nonlinear effects of and interactions between existing features.
But how do we make models simpler?
One approach is to combine features:
Principal components.
Partial least squares.
Another approach is to penalize for over-fitting \(\rightarrow\) regularization.
Both approaches involve tuning parameters.
Principal component analysis (PCA) is a data reduction technique.
Consider the placement of German political parties on three issues \(\rightarrow \mathbb{R}^3\).
Is there a \(\mathbb{R}^2\) or \(\mathbb{R}^1\) space that adequately accounts for the data?
Source: Analytics Vidhya
Transform a set of correlated features into a smaller set of principal components with the following properties:
Given the \(P \times P\) covariance matrix \(\boldsymbol{S}\), the total variance is equal to \(T = \sum_{j=1}^P s_{jj}^2\).
We perform the spectral decomposition
\[ \boldsymbol{S} = \boldsymbol{C} \boldsymbol{L} \boldsymbol{C}^\top = \sum_{j=1}^P \lambda_j \boldsymbol{c}_j \boldsymbol{c}_j^\top \]
Here,
\(\boldsymbol{L}\) is a diagonal matrix of eigenvalues, whose elements are \(\lambda_j\).
\(\boldsymbol{C}\) is a \(P \times P\) orthogonal matrix of eigenvectors or loadings, whose rows correspond to the features and whose columns correspond to the components.
Mathematically, \(\sum_{i=1}^P \lambda_i = T\).
Thus, the \(i\)th component accounts for \(100 \cdot \frac{\lambda_i}{\sum_i s_{ii}^2}\) percent of the total variance.
We typically retain the \(M < P\) components that account for most of the variance.
We treat \(M\) as a tuning parameter.
Instance \(i\)’s score on the \(m\)th principal component is
\[ p_{im} = \sum_{p=1}^P c_{pm} x_{ip} \qquad(1)\] where
\(i\) indexes an instance
\(m\) indexes a component
\(p\) indexes the feature
\(c_{pm}\) is the loading
In matrix terms,
\[ \boldsymbol{P} = \boldsymbol{X} \boldsymbol{C} \qquad(2)\]
We can use the loadings to make sense of the principal components.
However, the loadings are not unique.
We can account for the same portion of \(T\) through arbitrary orthogonal transformations of \(\boldsymbol{C}\) that we call rotations.
Rotation is our friend—a carefully selected rotation can simplify the interpretation.
| EU | spending | immigration | |
|---|---|---|---|
| CDU | 6.38 | 5.73 | 5.73 |
| SPD | 6.38 | 3.45 | 3.91 |
| FDP | 5.69 | 7.36 | 3.60 |
| Gruenen | 6.23 | 2.82 | 2.09 |
| Linke | 3.00 | 0.50 | 4.00 |
| AfD | 4.85 | 6.18 | 7.45 |
| CSU | 1.62 | 7.89 | 9.30 |
Standard deviations (1, .., p=3):
[1] 3.4172813 2.1019527 0.9245449
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
EU -0.2924737 -0.6993803 0.6521705
spending 0.6719890 -0.6355302 -0.3801739
immigration 0.6803602 0.3270605 0.6558517
The first component explains 68.9% of the total variance, the second 26.1%, and the third 5.0%.
Loadings:
PC1 PC2
EU 0.268 -0.709
spending 0.925
immigration 0.270 0.705
PC1 PC2
SS loadings 1.000 1.000
Proportion Var 0.333 0.333
Cumulative Var 0.333 0.667
The 1st component is mostly about spending and thus economic in nature. The 2nd component is about open versus closed borders.
# Done here by hand, but can be done more easily using the psych package.
# Step 1: Extract original loadings for PC1 and PC2.
L <- pca_fit$rotation[,1:2]
# Step 2: Extract rotation matrix from varimax rotation.
W <- pca_rotate$rotmat
# Step 3: Re-create rotated loadings.
M <- L %*% W
# Step 4: Center the features.
X_centered <- as.matrix(german_df) - colMeans(german_df)
# Step 5: Generate scores.
scores <- X_centered %*% M
scores [,1] [,2]
CDU 1.374228 -0.6587567
SPD -1.426580 -1.7698443
FDP 2.102471 -1.2593214
Gruenen -2.338837 -3.1188807
Linke -5.036796 0.6904769
AfD 1.823888 2.0510158
CSU 3.058649 5.2337300
We somewhat haphazardly settled on 2 components based on explained variance.
In machine learning, we would generally rely on cross-validation to find the optimal number of components from the perspective of predictive performance.
A tuning parameter (a.k.a. hyper-parameter) is a parameter that affects the operation of the algorithm but cannot be estimated from the data.
Oftentimes, these tuning parameters concern model complexity.
An example is \(M\), the number of principal components to be retained.
Although the modeler ultimately sets the value of the tuning parameter, it is possible to let the data speak to that decision \(\rightarrow\) (cross-)validation.
Re-sampling is an out-of-sample method to assess whether results will generalize.
The alternative is re-substitution—measure performance on the same data used to optimize performance in the first place.
The resulting error is known as the re-substitution error.
Re-substitution errors over-estimate performance (Efron, 1983).
This is why we look for the alternative of re-sampling.
We can further sub-divide the training set using a number of re-sampling approaches:
In each case, we generate out-of-sample estimates of performance.
A discussion of the pros and cons of the various methods can be found in Molinaro et al. (2005).
Earlier, we saw how to split the data into training and test sets.
We can divide the training set again, resulting in two components:
The training set proper
The validation set
The validation set allows us to assess how choices about tuning parameters in the training process generalize before we deploy the tuned model on our test set.
tidymodelslibrary(rio)
library(tidymodels)
library(tidyverse)
tidymodels_prefer()
happy_df <- import("Data/whr23.dta")
row.names(happy_df) <- happy_df$country
happy_clean <- happy_df %>%
select(happiness,logpercapgdp,socialsupport,healthylifeexpectancy,
freedom,generosity,perceptionsofcorruption) %>%
na.omit
set.seed(10)
happy_split <- initial_split(happy_clean, prop = 0.6, strata = happiness)
happy_train <- training(happy_split)
happy_test <- testing(happy_split)The split sample approach has a couple of disadvantages:
We train on a potentially small fraction of the original data, which induces a downward bias in performance.
We become only one value for the optimal tuning parameter, which means our sense of generalizability is limited.
If we have a lot of data, these problems are typically not all that severe.
In smaller data sets, we should and can do better through cross-validation or bootstrapping.
Randomly assign instances to \(k\) folds (typically, \(k = 10\)).
Each fold is held out once for validation purposes.
Each fold is used \(k-1\) times for training purposes.
The higher we set \(k\), the more data will be used for training at any given time.
An extreme case is leave out one CV (LOOCV), where \(k = n_1 - 1\).
In each run, bias is reduced but variance is increased.
Another way to reduce bias is to repeatedly assign instance to the folds, each time using a different seed.
Monte Carlo CV selects the folds randomly each time, which may cause folds to contain overlapping instances (Xu & Liang, 2001).
tidymodels# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [72/8]> Fold01
2 <split [72/8]> Fold02
3 <split [72/8]> Fold03
4 <split [72/8]> Fold04
5 <split [72/8]> Fold05
6 <split [72/8]> Fold06
7 <split [72/8]> Fold07
8 <split [72/8]> Fold08
9 <split [72/8]> Fold09
10 <split [72/8]> Fold10
# 10-fold cross-validation repeated 5 times
# A tibble: 50 × 3
splits id id2
<list> <chr> <chr>
1 <split [72/8]> Repeat1 Fold01
2 <split [72/8]> Repeat1 Fold02
3 <split [72/8]> Repeat1 Fold03
4 <split [72/8]> Repeat1 Fold04
5 <split [72/8]> Repeat1 Fold05
6 <split [72/8]> Repeat1 Fold06
7 <split [72/8]> Repeat1 Fold07
8 <split [72/8]> Repeat1 Fold08
9 <split [72/8]> Repeat1 Fold09
10 <split [72/8]> Repeat1 Fold10
# ℹ 40 more rows
# Leave-one-out cross-validation
# A tibble: 80 × 2
splits id
<list> <chr>
1 <split [79/1]> Resample1
2 <split [79/1]> Resample2
3 <split [79/1]> Resample3
4 <split [79/1]> Resample4
5 <split [79/1]> Resample5
6 <split [79/1]> Resample6
7 <split [79/1]> Resample7
8 <split [79/1]> Resample8
9 <split [79/1]> Resample9
10 <split [79/1]> Resample10
# ℹ 70 more rows
# Monte Carlo cross-validation (0.9/0.1) with 20 resamples
# A tibble: 20 × 2
splits id
<list> <chr>
1 <split [72/8]> Resample01
2 <split [72/8]> Resample02
3 <split [72/8]> Resample03
4 <split [72/8]> Resample04
5 <split [72/8]> Resample05
6 <split [72/8]> Resample06
7 <split [72/8]> Resample07
8 <split [72/8]> Resample08
9 <split [72/8]> Resample09
10 <split [72/8]> Resample10
11 <split [72/8]> Resample11
12 <split [72/8]> Resample12
13 <split [72/8]> Resample13
14 <split [72/8]> Resample14
15 <split [72/8]> Resample15
16 <split [72/8]> Resample16
17 <split [72/8]> Resample17
18 <split [72/8]> Resample18
19 <split [72/8]> Resample19
20 <split [72/8]> Resample20
The bootstrap consists of repeatedly sampling \(n_1\) instances with replacement (Efron, 1979).
Sampled instances are used for training.
Non-sampled instances are used for validation.
tidymodels# Bootstrap sampling
# A tibble: 100 × 2
splits id
<list> <chr>
1 <split [80/29]> Bootstrap001
2 <split [80/34]> Bootstrap002
3 <split [80/27]> Bootstrap003
4 <split [80/25]> Bootstrap004
5 <split [80/26]> Bootstrap005
6 <split [80/28]> Bootstrap006
7 <split [80/27]> Bootstrap007
8 <split [80/26]> Bootstrap008
9 <split [80/24]> Bootstrap009
10 <split [80/29]> Bootstrap010
# ℹ 90 more rows
For each value of a tuning parameter, the model needs to be re-run multiple times.
This can be extremely slow for complex models.
This is why you often see that validation sets are used in deep learning.
The problem of over-fitting can be stated as: \(P\) weights are too many for the task.
We have seen that we can reduce \(P\) predictive features to \(M < P\) principal components.
If we use the components as the predictive features, then this should decrease the risk of over-fitting the data.
This is known as principal component regression.
From a predictive modeling perspective, the problem with PC regression is that it never considers the label when constructing component scores.
This means that PC regression is not designed to optimize the prediction of the label.
We can change that by not just considering the total variance of the predictive features, but also the covariance of those features with the label.
Partial least squares (PLS) allows this (Liu et al., 2022; Wold et al., 2001).
For the \(m\)th component,
\[ \max_{\boldsymbol{\omega}_m} \text{Cor}^2(\boldsymbol{y},\boldsymbol{X}\boldsymbol{\omega}_m) \text{Var}(\boldsymbol{X}\boldsymbol{\omega}_m) \qquad(3)\]
subject to the constraint that \(\boldsymbol{\omega}_m\) is orthonormal.
As in PC regression, \(M\) is set through cross-validation. It is standard practice to standardize the predictive features.
Grid searches:
We pre-define a set of values that should be evaluated.
Can be inefficient when the number of parameters is large.
Can itself be broken down into regular and irregular grids.
Iterative searches:
We sequentially obtain new parameter combinations based on prior results.
Efficient in terms of number of passes, but each pass may take more time.
Can itself be broken down into Bayesian optimization and simulated annealing.
For our simple problem, we do a regular grid search; other approaches will be discussed later.
happiness logpercapgdp socialsupport healthylifeexpectancy
Bangladesh 4.282 8.685 0.544 64.548
Chad 4.397 7.261 0.722 53.125
Comoros 3.545 8.075 0.471 59.425
Congo (Kinshasa) 3.207 7.007 0.652 55.375
Egypt 4.170 9.367 0.726 63.503
Gambia 4.279 7.648 0.584 57.900
freedom generosity perceptionsofcorruption
Bangladesh 0.845 0.005 0.698
Chad 0.677 0.221 0.807
Comoros 0.470 -0.014 0.727
Congo (Kinshasa) 0.664 0.086 0.834
Egypt 0.732 -0.183 0.580
Gambia 0.596 0.364 0.883
final_recipe <- recipe(happiness ~ ., data = happy_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_pls(all_numeric_predictors(), outcome = "happiness", num_comp = 1)
pls_prep <- prep(final_recipe)
tidied_pls <- tidy(pls_prep, 2)
tidied_pls %>%
group_by(component) %>%
slice_max(abs(value), n = 5) %>%
ungroup() %>%
ggplot(aes(value, fct_reorder(terms,value))) +
geom_col(show.legend = FALSE, fill = "#31688EFF") +
facet_wrap(~ component, scales = "free_y") +
labs(y = NULL) +
theme_bw()Regularization adds a penalty for over-fitting to the loss function.
This penalty serves as a constraint on the optimization problem.
We can also think of this as a shrinkage estimator, with small coefficients being shrunk to the benefit of larger coefficients.
The most important regularization procedures are
The lasso.
Tikhonov regularization.
Elastic nets.
Lasso = least absolute shrinkage and selection operator.
We impose \(\sum_{j=1}^P \lvert \omega_j \rvert \leq t\).
Using Lagrange multipliers, the lasso loss is \(L_{\text{lasso}} = L_2 + \lambda \left( \sum_{j=1}^P \lvert \omega_j \rvert - t \right)\).
Retaining only the terms in \(\omega\), the optimization problem is
\[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \lvert \boldsymbol{\omega} \rvert \qquad(4)\]
\(\lambda\) is a tuning parameter.
We can think of the lasso as a special case of
\[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \lvert \boldsymbol{\omega} \rvert^q \]
For \(q = 1\), this becomes the lasso.
For \(q = 2\), this becomes Tikhonov regularization (a.k.a. ridge regression).
We minimize the \(L_2\)-norm subject to \(\sum_{j=1}^P \omega_j^2 \leq t\).
Hence,
\[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \sum_{j=1}^P \omega_j^2 \qquad(5)\]
Elastic nets are a mixture of the lasso and Tikhonov regularization.
For \(\alpha \in [0,1]\), define the penalty
\[ P(\alpha) = \sum_{j=1}^P \left[ \frac{1}{2} (1 - \alpha) \omega_j^2 + \alpha \lvert \omega_j \rvert \right] \qquad(6)\]
Then
\[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda P(\alpha) \qquad(7)\]
Note this reduces to
the lasso for \(\alpha = 1\)
Tikhonov regularization for \(\alpha = 0\)
tidymodelsFrom PLS, we recycle:
The split sample.
The cross-validation setup.
The standardization.
We use dials for tuning instead of the tibble we used for PLS. We use a regular grid.
model_recipe <- recipe(happiness ~ ., data = happy_train) %>%
step_normalize(all_numeric_predictors())
elastic_spec <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
elastic_wf <- workflow() %>%
add_model(elastic_spec) %>%
add_recipe(model_recipe)
elastic_metrics <- metric_set(rsq)glmnet_param <- extract_parameter_set_dials(elastic_spec)
elastic_grid <- grid_regular(glmnet_param, levels = 5)
print(elastic_grid, n = 6)# A tibble: 25 × 2
penalty mixture
<dbl> <dbl>
1 0.0000000001 0.05
2 0.0000000316 0.05
3 0.00001 0.05
4 0.00316 0.05
5 1 0.05
6 0.0000000001 0.288
# ℹ 19 more rows
tidymodels shines in its handling of tuning parameters, allowing for optimized experimental designs. Here, we shall use a Latin hyper-cube. This ensures that we fill the entire tuning parameter space.